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Abstract 



This work is devoted to practical joint source channel coding. Although the 
proposed approach has more general scope, for the sake of clarity we focus on a 
specific application example, namely, the transmission of digital images over noisy 
binary-input output-symmetric channels. The basic building blocks of most state- 
of the art source coders are: 1) a linear transformation; 2) scalar quantization of 
the transform coefficients; 3) probability modeling of the sequence of quantization 
indices; 4) an entropy coding stage. We identify the weakness of the conventional 
separated source-channel coding approach in the catastrophic behavior of the en- 
tropy coding stage. Hence, we replace this stage with linear coding, that maps 
directly the sequence of redundant quantizer output symbols into a channel code- 
word. We show that this approach does not entail any loss of optimality in the 
asymptotic regime of large block length. However, in the practical regime of finite 
block length and low decoding complexity our approach yields very significant im- 
provements. Furthermore, our scheme allows to retain the transform, quantization 
and probability modeling of current state-of the art source coders, that are carefully 
matched to the features of specific classes of sources. In our working example, we 
make use of "bit-planes" and "contexts" model defined by the JPEG2000 standard 
and we re-interpret the underlying probability model as a sequence of conditionally 
Markov sources. The Markov structure allows to derive a simple successive cod- 
ing and decoding scheme, where the latter is based on iterative Belief Propagation. 
We provide a construction example of the proposed scheme based on punctured 
Turbo Codes and we demonstrate the gain over a conventional separated scheme by 
running extensive numerical experiments on test images. 
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1 Introduction 



Shannon's source- channel separation principle [T] states that, in the hmit of large block- 
length and for a large class of communication setups, the optimal performance can be ap- 
proached by independently designing the source coding and the channel coding schemes. 
Driven by the separation principle, modern communication systems have been developed 
according to a rather rigid layered architecture [2]: the source coding functions are es- 
sentially relegated to the application layer, at the top of the protocol stack, while the 
channel coding functions are located at the link and physical layers, at the bottom of 
the protocol stack. While a separated (layered) architecture has the unquestioned advan- 
tage of modular system design, allowing the convergence of a great variety of services on 
a common data network infrastructurqj, there are cases where a Joint Source-Channel 
Coding (JSCC) approach is called for. On one hand, there exist several relevant multi- 
terminal settings where the separated approach is known to be suboptimal [1]. On the 
other hand, even in standard point to point channels where Separated Source-Channel 
Coding (SSCC) is asymptotically optimal, the use of independently designed source and 
channel codes may result in poor performance in the practical non-asymptotic regime of 
finite block length and low complexity encoding/ decoding. This paper is concerned with 
this second case. 

As a typical example, consider the transmission of digital images on a wireless channel]^ 
Present systems treat the compressed image as a data packet that must be delivered 
error-free to the destination, despite the fact that, differently from other kind of data, 
an image may be represented at the destination within some tolerable distortion. In a 
system based on separation, the distortion is introduced uniquely by source coding and 
the underlying link and physical layers struggle to deliver the source-encoded bits error- 
free, by using a combination of channel coding and retransmissions. Another typical 
example is represented by terrestrial or satellite Digital Video Broadcasting H] . Here 
retransmissions cannot be used. Therefore, the channel coding scheme is designed to 
achieve a very strict BER target of 10"^'^ or below. In both these examples, a very 
demanding performance requirement is imposed on the physical layer. This is due to 
the fact that the source coding scheme, that was designed assuming an error-free channel, 
exhibits a catastrophic behavior with respect to channel errors: even a small fraction of bits 

^The modern evolution of Internet, offering telephone, video, multimedia streaming and data on the 
same wired and wireless infrastructure is the paramount example of this trend. 

^This has become a killer application in cellular systems thanks to the widespread use of camera- 
equipped mobile phones. 
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in error at the source decoder input produces a very large distortion of the reconstructed 
source. 

The purpose of this paper is to outhne a new pragmatic approach to the design of JSCC 
schemes. Instead of focusing on ideahzed sources, channels and distortion measures as 
in most information-theoretic literature on JSCC (e.g., see [S] and references therein), 
we start from the typical structure of state-of the art source coders, as outlined in [B]. 
Although the focus of this paper is on practical system design, our approach is based 
on two key information theoretic results: a) linear codes achieve the Shannon limit of 
C/H source symbols per channel use in order to transmit an arbitrary source with sup- 
entropy rate H over a symmetric memoryless channel with capacity C (see Theorem 1 
in the following); b) under quadratic distortion, the concatenation of (dithered) scalar 
quantization with entropy coding achieves the rate-distortion function of a stationary 
ergodic source within bounded rate penalty [71 [8] . 

In our view, result (b) is the theoretical foundation of most state-of the art source 
"transform" coders, that are based on a linear transformation in order to project the 
source onto a convenient basis, followed by scalar quantization of the transform coefficients 
and entropy coding of the quantization indices. The latter is based on a carefully designed 
probability model matched to the class of sources to be encoded (see [6] and Section [2TTil . 
The catastropic behavior said above is due to the presence of the entropy coding stage, 
that is typically implemented by using standard data compression algortihms such as 
adaptive arithmetic coding or Huffman coding O [9], [1]. Fortunately, thanks to result 
(a), the conventional entropy coding stage can be replaced by a linear non-catastrophic 
encoder that maps directly the redundant symbol sequence output by the quantizer into 
a channel codeword (see Section [273]) . 

In order to illustrate our ideas, we use as running example the transmission of digital 
images over a Binary-Input Output-Symmetric (BIOS) channel and use JPEG2000 [9] 
as our baseline source coder. Since the sequence of quantization indices is not binary 
in general, we use a successive encoding and "onion-peeling" decoding architecture: the 
chain rule decomposition of entropy ensure asymptotic optimality of this approach, that 
has the advantage of making use of binary linear codes, whose design is well understood. 
As argued in Section [6l our approach can be readily extended to a variety of channels and 
sources. 

As far as low- complexity decoding is concerned, in Section [3] we reduce the probability 
model defined by JPEG2000 to a Markovian model, that admits a very simple trellis 
Factor Graph. This yields an efficient joint source-channel iterative decoding scheme 
based on Belief Propagation (BP) [inilll]. In Section H] we discuss the design of the linear 
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coding stage. In this paper we consider the use of punctured Turbo Codes, although 
other famihes of hnear codes can be easily used instead. Numerical experiments that 
demonstrate the advantage of the proposed scheme are presented in Section [51 

Brief literature survey. JSCC is a very vast subject and covering it is well beyond 
the scope of this paper. Here, we focus only on classical and recent works directly related 
to our approach. 

The fact that linear codes with syndrome (linear) encoding achieve entropy for discrete 
sources is well-known (see [12] and references therein). This result was recently extended 
to arbitrary sources (also non-stationary non-ergodic) in [13] . We make use of this result to 
prove our Theorem 1, which is indeed a simple corollary. The optimality of linear codes for 
(almost) lossless fixed-to-fixed length source coding, together with the advances in channel 
coding that followed the discovery of Turbo Codes [Hj and the re-discovery of LDPC codes 
[llj . spurred an impressive amount of work aimed at using these families of codes with 
low- complexity BP decoding for data compression (see for example [151 HSl HZl [TSl [121 USi 
f2I\ [221 [231 [13] )• While a channel coding approach is needed for Slepian-Wolf separated 
coding of correlated sources, it is quite immediate to verify that it is not competitive with 
standard fixed-to- variable length entropy coding both in terms of complexity and in terms 
of performance in the standard lossless data compression setting. The only exception is, 
perhaps, the algorithm devised in [20, [211 [221 [231 113] based on closed-loop "doping", 
that allows the decoder to achieve zero-error by allowing for some small variability in the 
encoding length. However, when transmitting over a noisy channel closed-loop doping 
cannot be applied in a straightforward manner since the encoder cannot replicate exactly 
the decoding process. 

On the other hand, it is well-known that linear codes with linear encoding are bounded 
away from the rate-distortion function [2^[25l[26] . Linear codes in the lossy source coding 
setting have been proposed as a structured way to construct the quantization codebook. 
However, the encoder must be non-linear, and typically involves a high complexity. Clas- 
sical results on this topic are [271 1231 [2H1 [301 [31] and more recent results can be found, for 
example, in [321 1331 [31] • 

We would like to stress the fact that our approach is very different from the above 
works: we do make use of linear codes and linear encoding. However, linear encoding is 
applied to the output of a scalar quantizer in order to map the (redundant) quantization 
indices onto channel symbols. The scheme is not limited by Ancheta's negative result 
[211 [26] because quantization is a non-linear operation. However, the scheme is also not 
limited by the lossless or almost lossless requirement because we are in a lossy setting. 
Furthermore, we take advantage of the very low complexity of scalar quantization and 
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linear encoding. In simple terms, we take the best of both non-linear and linear encoding 
approaches without paying a high price in terms of performance. 

Another set of relevant related works deals with JSCC for redundant data, with a BER 
(Hamming distortion) performance criterion. This has been recently pursued, for example, 
in [251 ES]- Our use of punctured Turbo Codes for the linear coding stage is largely inspired 
by [3S|, with some differences that shall be pointed out in Section |H At the decoder we do 
make use of BP iterative decoding taking the source statistics into account. This approach, 
generally referred to as joint source-channel decoding, or source-aided decoding, is treated 
in a very large number of works, as for example [371 EHl [39l HOI HH |42l |43]. Several works 
do not consider eliminating the entropy coding stage in the encoder as done here, but make 
use of an iterative decoder that exploits some structure of the entropy code (Huffman, as in 
[HIHS] or arithmetic, as in [IQ]) in order to mitigate the catastrophic effect of residual bit 
errors after channel decoding on the entropy code inverse (decompression). Unfortunately, 
classical entropy coding algorithms do not lend themselves easily to soft-input soft-output 
BP decoding. Therefore these iterative scheme have typically high complexity and often 
not so exciting performance. 

As a final remark, we would like to mention that several works considered a milder 
form of JSCC based on the optimization of the error protection (redundancy) of chan- 
nel coding in order to optimize the end-to-end distortion performance of some standard 
embedded source coding scheme. This idea, which is directly related to the concept of 
unequal error protection, appears for example in B5l B6l W7\ HHl HHl EO] and it is briefly 
discussed in Section [2?2l We would like to mention that our approach does not make use of 
unequal error protection on the source data, with the exception of the model probability 
parameters that must be received error-free. This is similar to the header high-protection 
required by standard separated source-channel coding, and involves only a small fraction, 
vanishing in the limit of large block length, of the overall source length. 

2 Main ideas 

This section is devoted to the main ideas driving the proposed JSCC scheme. The pre- 
sentation is kept as general as possible. A more detailed presentation of the encoder and 
decoder design for the specific example of digital images on BIOS channels is provided in 
Sections [3] and HI respectively. 
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2.1 A typical source transform coding architecture 



We consider a general transform source coding architecture illustrated by the block dia- 
gram of Fig. [1] (see [6] and references therein), inspired by JPEG2000 [9]. 
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Figure 1: A JPEG2000-like source encoder. 6 denotes the estimated probability model 
assumed by the entropy coding stage. 



The source sequence s G is transformed by the linear map W : M.^ (e.g., 
a wavelet transform). The sequence of transform coefficients z = {zi, . . . , Zk) = VV(s) 
is quantized by applying componentwise the scalar quantizers Qfc : M ^ ^2'^^, for k = 
1,...,K, such that Uk = Qk{zk)% We denote by P the number of quantization bits 
used to represent the magnitude of sample Zk, and use one additional bit to represent the 
sign of Zk- Notice that this binary representation for the quantization indices does not 
involve any loss of generality. We let u = Q(z) = (ui, . . . ,uk) denote the sequence of 
quantization indices, and let {up ^ : p = 0, . . . , P} denote the "bits" forming Uk- We think 
of u as a two-dimensional (P+l) x K binary array, where the rows u*^^) = (Mp,i, . . . , Up^x), 
for p G {0, . . . , P}, are referred to as the p-th "bit-planes". The bit-plane p = contains 
the sign bit, and the bit-planes p = 1, . . . , P contain the magnitude bits, where p = 1 
corresponds to the least significant bit and p = P corresponds to the most significant bit. 

The sequence of quantization indices u is generally redundant. Therefore, u must be 
further compressed by a combination of decimation and entropy coding. By "decimation" 
we mean discarding some segments of the bit-planes: only the segments of u that are 
not discarded are effectively entropy-encoded and contribute to the encoder output b G 
F^. For the sake of exposition simplicity, in this work we assume that no decimation is 
performed, although the case of decimation can be easily handled. 

The entropy encoder is based on a probability parametric model, targeted to the 
specific class of sources (see [6] and references therein). In the sequel, we assume the 

■^The quantizers Qk may differ by their quantization regions and reconstruction points. For example, 
in the scheme considered here the quantizers Qk have a different dynamic range depending on which 
subband of the Discrete Wavelet Transform W the symbol Zk belongs to. 
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following probability model for our source: let {Pg {■) : 9 E Q , K = 1, 2, . . .} denote a 
family of processes, defined by the sequence of K dimensional probability distributionj^ 
on ¥2 parameterized by 6. For the time being, we assume that for each ^ G 6, the 

corresponding process is a stationary and ergodic, and denote its entropy rate by Hg{U). 
This assumption shall be revisited in Section [3] when we discuss more specifically the 
probability model underlying JPEG2000. Furthermore, we assume that the probability 
model is matched, that is, u ~ F^^^ for some 6 E &. A model mismatch would involve 
additional rate penalty. However, discussing mismatch in this context is rather pointless 
since the actual statistics of real-life sources such as images is essentially unknown. 

The modeler estimates the probability parameter 6 from the current realization of u 
and encodes losslessly the p-th bit-plane using about 

Sp = -log2Pf^(u(^')|u^''^'^---,u(^^) (1) 

bits. As we will see in Section [3l in our working example the probability model is condi- 
tionally Markov with fixed state transition diagram, where the conditioning on the p-th 
bit-plane is due to the upper bit-planes p + 1, . . . , P. Then, the model parameter 6 con- 
sists of the collection of transition matrices defining these Markov chains, the elements 
of which can be easily Maximum-Likelihood (ML) estimated by counting the empirical 
frequency of symbols corresponding to each state transition. 

In source coding schemes such as JPEG2000, the model parameter estimation and the 
entropy coding is performed simultaneously, by sequential estimation of the state tran- 
sition probabilities with a Krichevsky-Trofimov (KT) probability estimator [51] together 
with arithmetic coding E]. In Fig. [T] we represent the probability model estimation 
and the entropy coding as two separate blocks for conceptual simplicity and because in the 
proposed scheme (see Section 12.31) these two functions are indeed performed separately. 

The output length of the source encoder is given by i? = J2p=o ^p- follows that the 
source coding rate is given by 

-Rg = — bit /source sample 

This corresponds to some target distortion level. In the case of no decimation, distortion is 
due uniquely to the quantization error. We shall refer to this distortion value, denoted by 
Dq, as the quantization distortion. Of course, a more flexible tradeoff between distortion 
and rate can be achieved by considering decimation. 

'^In the interest of notation simplicity, in this paper we use the symbol P to denote probability with 
respect to the appropriately defined joint probability space. We do not distinguish between the random 
variables and the arguments of the probability density or mass function, since it is clear from the context. 
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2.2 Separated approach and Shannon Umit 

We consider the problem of transmitting the source sequence s over a stationary memo- 
ryless BIOS channel [11] with capacity C bit/channel use. 

In a classical SSCC approach, the compressed bit sequence b produced by the source 
encoder is mapped into a channel codeword x G by a channel encoder of rate = 
B/N. The resulting encoding efficiency rj (measured in source samples per channel use) 
is given by 

K Rc . . 

In the limit of large K, assuming stationarity and ergodicity of the source and a ML prob- 
ability estimator such that 6 ^ 6, we have that B — > KHg{U). Furthermore, when is 
an M-dimensional compact set, Rissanen's bound [S3] ensures that the model represen- 
tation redundancy grows only as ^logi^, and therefore communicating the probability 
model parameter 6 as side information to the decoder costs asymptotically a vanishing 
rate penalty (more details on the parameter representation are given in Section [3]). 

From what said above and Shannon source and channel coding theorems [1] it follows 
that, for arbitrary 6,e > and sufficiently large K, there exist pairs of source and channel 
codes achieving efficiency 

with error probability P(b 7^ b) < e, where b denotes the channel- decoded sequence of 
information bits. 

The point with coordinates {C/ Hg{U), Dq) on the efficiency-distortion plane corre- 
sponds to the best possible performance for a JSCC with fixed quantizer Q and source 
model parameter 6. This point shall be referred to as the Shannon limit for our system. 
Notice that for a source with rate distortion function R{D), the best possible efficiency 
is given by C/R{Dq) > C/Hg{U). Nevertheless, for complicated sources such as images 
the rate distortion function is not generally known. Furthermore, scalar quantization fol- 
lowed by entropy coding is near-optimal over a wide range of rates a wide class of sources 
[54l [Tj [HI EHl [56] . Following the pragmatic approach advocated in [6] , we say that the rate- 
distortion point {Hg{U), Dq) is a point on the source encoder operational rate- distortion 
curve. 

As already argued in Section [H it is well-known that for practical (finite) source block 
lengths and channel encoding/decoding complexity the SSCC scheme obtained by the 
concatenation of the transform coder described in Section 12.11 with a channel code might 
perform quite poorly. It is well-known that entropy coding has a catastrophic behavior: its 
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inverse function is ill-conditioned. A small number of bit-errors in the channel- decoded bit 
sequence b generates a large number of symbol errors in the entropy-decoded quantization 
index sequence u, and eventually a large distortion in the reconstructed sequence s = 

The catastrophic behavior of the source encoder can be partially mitigated by imposing 
an embedded structure. An encoder is said to be embedded if for any B' < B, the output 
sequence b' produced for output length B' is the prefix of sequence b produced for output 
length B. With an embedded source encoder, all the bits in b from the beginning to 
the first occurred bit-error can be used for reconstruction, while all the rest must be 
discarded. Based on this idea, several works (see for example [HI HHl HHl HH HHl SHI 150] ) 
have addressed the optimization of channel coding redundancy in order to maximize the 
average number of correctly received bits or minimize the average distortion before the 
occurrence of the first bit-error. In this paper we take a different approach, outlined in 
the next section. 



2.3 The proposed approach 

Instead of mitigating the catastrophic behavior of the entropy coding stage, we avoid it 
by taking a JSCC approach: we merge entropy coding and channel coding into a single 
non-catastrophic encoding operation, that maps the redundant sequence u directly into 
the channel codeword x (see Fig. [2]). 
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Figure 2: Proposed joint source-channel coding scheme. The estimated probability model 
parameters 6 are separately transmitted as side information. 



Since linear codes achieve the capacity of BIOS channels [57] we shall consider a linear 

^With some abuse of notation, we denote by the reconstruction mapping of the quantizer, i.e., 
Q~^(u) denotes the representation point s of the quantization bin indexed by u. 
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map u I— s> X. Furthermore, since binary linear codes are particularly simple and well 
understood, we shall implement this linear mapping in layers, bit-plane by bit-plane. 
In particular, we consider P + 1 linear codes Co, . . . ,Cp with block length A'o, . . . , Np 
and generator matrices Go, . . . , Gp. We obtain the codeword x as the concatenation of 
x'^'') , • • • , x*^-^^ , where 



Three questions naturally arise at this point: 1) Suppose that for each 6 we can pick the 
best encoding matrices, can we approach the Shannon limit? 2) Is it necessary to "fine 
tuning" the encoding matrices {Gp} for each set of source parameters 6 G O? 3) Can we 
find a low-complexity decoder for the JSCC scheme? 

Questions 1 and 2 are addressed simultaneously by Theorem 1 here below and by the 
comment that follows. Question 3 is addressed in Section HJ where the Markov structure of 
the source probability model is exploited in conjunction with the structure of the binary 
linear codes in order to obtain a low- complexity iterative joint source- channel decoder 
based on BP [101 [III [H]. 

The asymptotic goodness of our scheme is supported by the following result, that is 
an immediate corollary of the optimality of linear codes for both lossless compression 
[571 [22l [13] and for achieving capacity of BIOS channels (see [57] and references therein) 
and of the fact that the concatenation of two linear maps is a linear map. 

Theorem 1. Consider a binary source V defined by the sequence of i^- dimensional 
joint probability distributions {Py^'*(v) : K = 1,2, . . .} over Ff . Define the sup-entropy 
rate H{V) of V as [58l [59] the limsup in probability of the sequence of random variables 



as K oo. Consider a system that, for each length K, maps source sequences v into 
binary codewords c = vG of length A^, and transmits c over a BIOS channel with capacity 
C. Let y denote the channel output and ■?/; : y i-^ v be a suitable decoder. 



Then, for any 6,e > and sufficiently large K there exists a K x N matrix G and a 
decoder such that P(v ^ v) < e and K/N > C/H{V) - 6. 



x(p) = u^P^Gp 



(4) 



ilog^Pf^W 



that is, the infimum of all values h for which 




(5) 



Proof. See Appendix [K\ 



□ 
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We argue that Theorem 1 has an important consequence for the joint source- channel 
coding of a family of sources G and a fixed BIOS channel. In fact, for each value H 
there exists one sequence of encoding matrices of increasing block length K and efficiency 
arbitrarily close to C/H such that the decoding error probability vanishes for all the 
source statistics with parameters 6* G such that Hq{U) = H. This fact is seen by 
considering a mixed source ^H] V defined by the family of distributions {Fg : 6 G 7i}, 
where 7i C 6 is the set of parameters for which HgiU) = H and where 9 has a uniform 
prior probability over Ti. It is easily seen that in this case H{V) = H. Therefore, by 
Theorem 1, there exists a sequence of coding matrices of increasing block length such that 
Fg{v ^ v) < e and K/N > C/H - 5. 

As far as implementation is concerned, this implies that we need only to design one 
set of coding matrices {Gp} for each value H of the source entropy rate. If "tuning" of 
the codes were needed for each particular ^ G O, the scheme would be impractical. In 
fact, this would require a very large set of codes that do not differ only by their rate, but 
also by their structure (e.g., by their generator polynomials in the case of TCs, or degree 
distributions in the case of LDPCs). On the contrary, families of codes with different rate 
can be obtained by progressive puncturing of a single "mother" code, in a very convenient 
way for implementation. This is in fact the approach taken in the code design of Section 
in we define a fine quantization grid on the interval [0, 1] of possible values of the bit- 
plane empirical entropy rate and design a coding matrix Gp for each quantized rate value. 
Then, when encoding the p-th bit-plane, we compute the conditional empirical entropy 
rate 

H{Up\Up^,, ...,Up) = -^ log, • • • , u(^)) (6) 

and choose the corresponding (pre-designed) encoding matrix. 

A similar approach is followed in [20l [2H [22l [T3] where linear block codes are used 
for universal lossless compression. Another possibility consists of using rateless codes 
[5UI EH [22], in order to be able to generate an arbitrary amount of coded symbols and 
therefore a continuum of rates. This approach was followed in the context of universal 
lossless compression in [23] and it is briefiy discussed in Section [6] as a possible extension 
of the present work. 

For an arbitrarily fine rate quantization grid, we let the codeword block length of the 
p-th bit-plane be given by 
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where Sp > is a. small rate margin. The achieved coding efficiency is given by 



K 




Z^p=o c ~'~ p 



H{U) 



(7) 



where 5 is a small positive quantity that vanishes as all Sp ^ and H{U) 



i_ 

K 



log^pf^H 



is the empirical entropy rate of u. Since H{U) — * Ho{U), it follows that for asymptoti- 
cally large K the proposed method approaches the operational Shannon limit defined in 
Section [221 

The advantage of JSCC over SSCC becomes clear in the non-asymptotic regime of 
moderate K and practical low complexity channel coding and decoding. In fact, the 
design of non-catastrophic linear encoders is a very well-known and well understood topic 
in coding theory [62]. In particular, powerful channel coding families such as Turbo Codes 
(TC) mi, Low-Density Parity-Check (LDPC) codes [IH [631 El [65] and Irregular Repeat- 
Accumulate (IRA) codes [66l[67] can easily achieve post-decoding BER between 10~^ and 
10~^ provided that their rate is below a certain threshold, that is close to C/Hg{U) even 
for moderate information block length K. While, as said before, such values of BER would 
produce very large distortion in the presence of the entropy coding stage, in the proposed 
system the BER affects directly the bit-plane components Up^k- It is clear that some bits 
in error in the bit-planes yield a small output distortion, since the inverse transform VV~^ 
is linear, unitary or close to unitary, and hence well-conditioned. 

As we will see in Section [H the proposed decoder based on BP [101 El HI] computes 
efficiently an accurate approximation of the symbol-by-symbol posterior probabilities 



where y is the channel output corresponding to the transmission of x. 

As far as the source reconstruction is concerned, we let (without loss of generality) 
the k-th scalar quantizer map Uk = Qk{z) be 



{P(np,fc|y) :p = 0,...,P, /c = 1, . . . 



(8) 
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where the parameter determines the dynamic range of the quantizer for the k-th 
transform coefficient. The corresponding reconstruction function is given by 

A ^ 

Q-\u,) = i-ir,.^J2u,,,2^ (10) 

p=i 

Given the symbol-by-symbol posterior probabilities ([H]) produced by the BP decoder, 
both hard and soft reconstruction are possible. In the first case, symbol-by- symbol hard 
decisions Up^k = argmax^^jp^ P(up ^ = v\-y) are used in ffTOl) to generate an estimate 
Zk of the k-th transform coefficient. In the second case, the MMSE (conditional mean) 
estimator Zk = E[2;fc|y] is computed. Assuming zero-mean quantization noise statistically 
independent of the channel output y, this takes on the appealing simple fori 



Zk 



^tanhf^^y^^ (11) 

^ ' p=i 

where we define the a posteriori log-likelihood ratio (LLR) for symbol u^^k as 

, , P(Mp,fc = 0|y) 

= log ^t: ^TkTT (12) 



After producing the sequence z = (ii, . . . , ix), the source is reconstructed by applying 
the inverse linear transform = >V~^(z). The soft reconstruction approach defined in 
ffTTj) is sometimes referred to as "soft-bit" reconstruction in the literature [68| [69]. 



3 Probability model, estimation and lossless com- 
pression 

From this section to the end of the paper we illustrate in more details an implementation 
example based on JPEG2000. Despite the loss of generality, by developing this example 
we hope to corroborate the claims made in Section [2^3] and gain in clarity. Generalizations 
are discussed in Section [61 

In JPEG2000 the linear transform W in the block diagram of Fig. [T] is a Discrete 
Wavelets transform (DWT) [701 [711 EZj • This determines the subband structure shown in 
Fig. [3l Let us consider a squared gray scale image of dimension n x n pixels. After D 
stages of DWT, the transform coefficients are partitioned into 'iD -|- 1 squared subbands 

^Of course, this is only an approximation if the BP decoder produces approximations of the true 
symbol- by-symbol a posteriori probabilities 
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(named LLd and HL^jLHii and HH^, for d = 1, . . . , D) oi dimensions ^ x ^, where 
d is the decomposition step and where "L" and "H" stand for Low and High frequency 
components, respectively. 



LL3 


HL3 


HL2 




LH3 












LH2 


HH2 




LHi 





Figure 3: Subband decomposition in the DWT transform with D = 3 levels. 



The transform coefficients Zk are quantized as in ([9]). The parameter Afc of the A;-th 
quantizer is set to a fixed value for all positions k in the same subband, and depends 
on the average energy content of the subband [9]. These parameters are sent separately 
to the decoder for reconstruction. Each quantized subband is partitioned into squared 
blocks called "code-blocks", that are independently entropy-encoded. The typical size of 
code-blocks is 32 x 32 or 64 x 64. This partitioning is done for the sake of decimation 
(discarding some code-blocks) and in order to avoid error propagation across code-blocks 
at the reconstruction side: if an error occurs in the channel- decoded stream, the error 
propagation will be limited inside a code-block. The probability model is estimated locally 
on each code-block, thus allowing a better matching of the model used for entropy coding 
with the local statistics of the quantization indices. In JPEG2000 this is obtained by 
resetting the KT probability estimator at the beginning of each code-block. Since in our 
system we do not have such error propagation problems and, for simplicity, we do not 
consider decimation, we shall not use a rigid partitioning into code-blocks. Nevertheless, 
we shall consider the matching of the probability model parameters to the local statistics, 
depending on the bit-plane index p and subband index d. The binary data corresponding 
to the bit-plane/subband index pair (p, d) will be referred to in the following as the {p, d) 
"segment" . 
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Figure 4: Stripe-oriented scanning of the quantization indices in order to induce a one- 
dimensional ordering. 

The two-dimensional sequence of quantization indices is arranged into a one-dimensional 
"temporal" ordering according to the so called "stripe-oriented" scanning scheme illus- 
trated in Fig.m Each stripe is formed by four rows of quantization indices [9] and a number 
of columns equal to the dimension of the subband. Then, each quantized coefficient Uk is 
identified uniquely by its position k in the resulting one-dimensional arrangement u. 

JPEG2000 models each bit-plane u^^^ as a binary correlated source, where the prob- 
ability distribution of a bit Up^k depends on the value of certain neighboring bits in the 
same plane and on certain bits in the upper bit-planes p+1, . . . , P. Without entering into 
the fine details of the scheme, that is extremely tedious and can be learned from standard 
references |9], we concentrate on the qualitative features of the probability model. 

The local dependency of bit Up^k on its neighbors can be formulated as a Markov chain, 
conditioned on the realization of the symbols in the upper bit-planes u^^"*"^^ , • • • , u*^^^ . 
Consider the conditional probability distribution 

K 

pW(u(p)|u(p+i)^ . . . , u(^)) = n nMup,u . . . , Up,k-i, u(^+i), . . . , u(^)) (13) 

fe=i 

The underlying conditional Markov model for the p-th bit-plane is represented by the 
block diagram of Fig. [5l 

The Markov chain state TTp^^ = {up^k-i, ■ ■ ■ , ^p,fc-L) is formed by the content of a causal 
sliding window of previous bits in the same bit-plane (the content of the shift-register of 
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Figure 5: Markov model underlying the conditional bit-plane joint distribution 



Fig. [5]), for some integer L. Furthermore, the dependency of bit Up^k on the bits in the 
upper bit-planes is also confined to a local collection of index pairs (p', k') G Sp^k- In other 
words, the local dependency set Sp^k is defined as the set of index pairs (p', k') for p' > p 
such that 

{V,fc" '■ (P"' ^") i <Sp,k,p" > p} {vrp,fc, {upi^k' ■■ (p', k') G Sp^k}} Up,k 

is a Markov chain. Obviously, for the top bit-plane we have Sp,k = ^ for all k. 

The Markov model of Fig. [5] has a fixed state diagram structure, and it is parameterized 
by the transition probabilities 

/ 

{Up'^k' ■ ip', k') G Sp^k] 



[up,kWp,i, • • • , Mp,fc-i, u(P+^\ . . . , u^^)) 



(14) 

Fortunately, these probabilities take on distinct values only for a small number of equiv- 
alent configurations of the conditioning bits. In the JPEG2000 parlance, we say that the 
conditioning bits (vTp fc, {upi^y '■ {p', k') G Sp^k}) define the context of bit Up^k- Despite the 
fact that we have 2'^"*'I'^P''=I possible configurations, many of them are equivalent. We define 

L/-\- 1 S I 

the context function at bit-plane p as /Cp : F2 ''''' ^ {0, . . . , M — 1}, for some integer 
M. We say that two configurations are equivalent if their image under JCp (the associated 
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"context") is the same. Then, for n G {0, . . . , M — 1} and ^ G {0, 1} we have 

for all configuration of the conditioning bits such that JC^Hp^k, {up'±' '■ {p', k') G Sp^k}) = '^• 
In JPEG2000 we have M = 17, where 12 contexts are used for the magnitude bit-planes 
and 5 contexts are used for the sign bit-plane. More details about the Markov state 
diagram structure and on context definition in the notation of this paper (which is rather 
different from the current JPEG2000 descriptions available in the literature [H]) is available 
in [73]. Fig. [3] shows the trellis diagram corresponding to the Markov chain of Fig. [5] for 
the top (P-th) bit-plane. We have L = 5, which yields a 32-state trellis. The states are 
enumerated from to 31, the label next to each state contains the state number and 
the corresponding context n. There are four non-equivalent trellis sections with the same 
state transition structure (defined by the shift-register, see footnote (11) in Section H]) but 
different correspondence between state and context value, that depends on the position of 
the bit in the stripe. As seen from Fig. H] there are four different positions correspoding to 
the four rows forming the stripe. Here, trellis sections from (a) to (d) corresponds to the 
four positions from top to bottom. Each state has two outgoing transitions corresponding 
to up^k being (solid) or 1 (dashed). For example, for state 26 in section (a), the solid 
transition corresponds to up^ = with probability Pe(0|5) and and the dashed transition 
corresponds to up^k = 1 with probability 1 — P0(O|5). We showed the P-th bit-plane for 
the LL and LH subbands. The state-context corespondence for the other subbands is 
different, and it is determined as explained in |73]. For the p-th bit-planes with p < P 
the correspondence depends on the value of the conditioning bits in the set Sp^k and then 
it varies with the time index k. It would be therefore very cumbersome to represent these 
trellises in this paper. 

At this point, it should be clear that the Markov model is completely determined by 
the LLRs 

, ^ 1 Pe{up,k = 0\k) 

z/p,fcK =log— -— , K = 0,...,M-1 15 

Pe{up,k = 1|«) 

In our scheme, we use a non-sequential ML estimator for the above probabilities, that is 
easily obtained by computing the empirical frequency of zeros for each context k, where 
separate bit-counts are maintained for each data block with homogeneous local statistics 
as explained in the following. 

As said before, the Markov model must be matched to the local statistics of u, that it 
is in general non- stationary. Different probability estimates are computed for each (p, d)- 
th segment. Furthermore, we found that it is convenient to estimate a local statistics for 
groups of adjacent stripes, generally smaller than a whole segment. 
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Figure 6: Trellis diagram for the Markov model of the P-th bit-plane: the Markov chain 
is time- variant, and the four sections (a), (b), (c) and (d) correspond to the four different 
positions on a bit in the stripe (see Fig. H]). 
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Suppose that segment {p, d) is partitioned into ^ regions (groups of adjacent stripes) 
and over each such region the model transition probabihties are locally estimated. For 
all positions k in the same region, the LLRs defined in (fT5|) take on the same value that 
depend only on the region, and not on the time index k. We denote such parameters by 
z/(p, d, i, n), where p denotes the bit-plane, d denotes the subband, i = 1, . . . , ^ denotes 
the region in the {p,d) segment and k, is the context. Then, //^ ^(k) = v{j),d,l, k) for all 
positions k corresponding to the £-th region of the (p, d)-i\i segment. It follows that the 
overall Markov model is piecewise stationary, and it is defined by 

3D+1 3D+1 P 

d=l (1=1 p=l 

real parameters, where Mi denotes the number of distinct contexts for the magnitude 
bit-planes and Mq denotes the number of distinct contexts for the sign bit-plane, with 
Mq + Ml = M. The model parameter 6 coincides with the collection of all the Ai LLRs 
{i'{p, d, i, k)} defined above. 

The estimated parameters 6 must be sent to the decoder separately and must be 
highly protected against channel errors, since the decoder needs the probability model for 
reconstruction (see Section H]). In the next section we shall discuss the compression-only 
performance of the scheme based on the above defined probability model. We discuss 
the optimization of the model parameter representation length and, as a sanity check, 
we compare the compression-only performance obtained by our Markovian probability 
modeler followed by arithmetic coding with that obtained by the standard JPEG2000. 

3.1 Compression-only performance 

If the probability model and estimator illustrated before is used for compression only, 
the resulting output length (in bits) is given by i?tot = B + Bmodci, where B = J2p=o 
with Bp given in ([T]) is the number of bits necessary to represent the bit-planes using the 
estimated probability model (this can be essentially achieved by using arithmetic coding 
based on the estimated probability F^\u)), and where -Bmodei is the model redundancy, 
i.e., the number of bits necessary to represent the estimated probability model parameter 

e. 

Thanks to the energy packing property of the DWT transform, the higher bit-planes 
are not identically zero only for "low frequency" subbands. We do not encode the iden- 
tically zero subbands by adding a bit fiag in the model parameter to notify the receiver 
about all-zero subbands. 
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In order to minimize the total output length, -Bmodei must be optimized. In particular, 
we have to choose the number of regions mp^^ to partition each segment, and the number of 
bits Qp^d for the description of each LLR parameter. To this regard, we have investigated a 
few possibilities. One option consists of defining regions as groups of Ng adjacent stripes, 
where the same for all segments. Another option consists of having different grouping 
values Up^d in each segment {p, d) . In this case, the values rip ^ for each segment must be 
added to the model description. As for the quantization bits for parameter representation, 
one option consists of using Rissanen's description length bound [53]. The achievability 
part of Theorem 1 in [53] suggests to represent each parameter u{p, d, i, k) by W log2 Kp j]. 
bits, where Kp^d denotes the length of the groups in the (p, d) segment. 

Since Rissanen's bound holds on average and it is an asymptotic result, it might not 
yield the best choice for a given source realization of finite length. Hence, for the sake 
of comparison, we have also considered a brute-force bit-allocation algorithm to find the 
global optimum over all values of rip ^ and Qp^d- The brute-force search is initialized by 
letting qp^d = and rip ^ = 1 for all p, g, which yields equiprobable bits (z/(p, d, £, k) = 
for all p, (i, k). Therefore, the initial value of the total output length is i?tot = K{P + 1), 
i.e., the length of the original redundant sequence u. Then, we search over all rip^ = 
1, 2, 3, . . . and Qp^d = 1, 2, . . . for the global minimum of the total description length of 
each (p, d) segment. The search over qp d is stopped when increasing the model parameter 
quantization bits by one does not correspond to a decrease of the total description length. 
Even though this search might appear computationally heavy, it should be noticed that 
in practice only a few values of Up^d and qp^d need to be considered. Also, since each 
segment (p, d) is independently encoded, the global minimum of the output length is 
found by independently minimizing the description length of each segment (p, d). Hence, 
the brute-force bit-allocation is actually feasible in practice. In this case, the pair of 
parameters {up^d, Ipa) for each segment (p, d) must be included in the model description 
is to be added to the total output length. 

We run some tests and comparisons based on the monochrome "Goldhill" 512 x 512 
and the monochrome "Lena" 512 x 512 test images, after D = 2 stages of Daubechies 
DWT [70| [9]. Quantization is on 512 levels (corresponding to P = 8). In Fig. [7] and [8] 
we show the values of the total output length as a function of when the parameter 
quantization bits are set according to Rissanen's bound. The curves have several local 
minima and maxima because of integer effects, since the segment lengths are generally 
not multiples of Ng and for some values of Ng we have spurious groups of stripes that 
cause the fluctuations. It is interesting to notice that the total output length is rather 
smooth with respect to Ng and stays close to its global minimum for a wide range of 



21 



values. Hence, optimization with respect to Ng is not very critical in practice, provided 
that Ng is chosen reasonably. This suggests that only a few values of Ns should be tried 
in a practical implementation. 

The values obtained using optimized values Up^d different for each segment (p, d) and 
Rissanen's model bit allocation are reported as an a horizontal line, denoted by ^^Opt. 
Riss." . Finally, the result of the brute-force bit-allocation is reported as a horizontal 
line, denoted by "Opt. No. Riss." . In both Fig. [7] and [8] it appears that optimizing 
with respect to stripe grouping values Up^d in each segment yields a significant advantage. 
On the contrary, the parameter quantization bit allocation given by Rissanen's bound 
is always very close to optimum. Since this allows a much faster optimization of the 
model description, this method is to be preferred and it is used for the rest of the results 
presented in this paper. 




Figure 7: Total output length for various options of the model description for the Goldhill 
test image. 

Table [H reports (first column) the total output length for "Goldhill" and "Lena" 
achieved by the proposed probability model using arithmetic coding to compress the 
data and the above optimized model description. For comparison, the JPEG2000 out- 
put length (second column) is reported for the same quantization distortion Dq. We 
notice that there is a remarkable agreement of the overall output length. This shows that 
the proposed Markov model is consistent with the probability model implicitly assumed 
by JPEG2000, and that the sequential adaptive entropy coder of JPEG2000 produces 
a redundancy very similar to the model redundancy obtained by our optimized model 
description. The slight advantage of our method is believed to be due to the fact that 
JPEG2000 resets the KT estimators on each code-block and inserts resynchronization 
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Figure 8: Total output length for various options of the model description for the Lena 
test image. 

Table 1: Comparison between the JPEG2000 output length (in bit per pixel) and the 
output length of an ideal entropy encoder based on the proposed probability modeler, for 
D = 2 stages of DWT, 8 bit-planes and optimized number of quantization bits for the 
model parameters. 
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symbols to limit error propagation. 

4 Encoding and decoding 

In this section we illustrate the linear joint source-channel encoding and obtain the Factor 
Graph (FG) of the joint probability distribution of the source sequence u and the channel 
output y. The FG yields directly a low-complexity iterative joint source-channel decoder 
based on BP, by a completely standard application of the Sum-Product computation rules 
[To]. Since the derivation of the BP algorithm is nothing more than an exercise, it will be 
omitted for the sake of conciseness. 
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4.1 Linear encoding by punctured Turbo Codes 



In order to seek a good tradeoff between performance and complexity, at possibly large 
but finite block length we resort to the use of well-known families of linear binary 
codes for which efficient BP decoding is easily implement able. In this paper we focus on 
the use of TCs [H ES|l!l 

As said in Section [273| we successively encode the bit-planes as x*^^) = u^^^^Gp, for all 
p = 0, . . . , P. We focus on the encoder of a generic p-th bit-plane and drop the index p 
for notation simplicity. We consider a TC family with two identical component binary 
Recursive Convolutional Codes (RCC) of rate 1. The RCC encoder is defined by the 



input output relatioio = ^^u{D). As usual, the code generators {a{D),b{D)) are 
expressed by their coefficients in octal notation. For example, the RCC with generators 
a{D) = 1 + D + + + and h{D) = 1 + DMs indicated by (37, 21)3. We use a 
tail-biting encoder [711 EHl ES]. Hence, the the mapping (mi, . . . , uk) ^ (a^i, • • • , xk) is 
given by 

X = uA-^B (17) 
where A is the K x K circulant matrix with first row 

(ao,ai, . . . ,a^,0,0, . . . ,0) 

K-fi-l 

and B is the circulant matrix with first row 

K-ii-l 

and /i denotes the RCC encoder memory. 

The turbo encoder with puncturing is obtained as follows. Let 111,112 denote two 
K X K permutation matrices (interleavers) , and Rq, Ri and R2 denote three puncturing 
matrices, of dimension K x uq, K x ni and K x n2, respectively. Notice that a. K x n 
puncturing matrix is a submatrix of a K x K permutation matrix obtained by selecting 
n out of K columns. Then, a generator matrix for the TC with given RCC component, 
interleaver and puncturing is given by 

G = [RoiniA^^BRiinaA^^BRs] (18) 



^LDPC codes [HI |63l EH El] and IRA codes [66IEI], as well as many variations thereof proposed in 
the recent literature could be used here, after obvious modifications. 

^D-transform notation: a sequence . . . , a;_l, xq, xi, X2, ... is represented by the Laurent scries x{D) = 
^£ X(D^ in the indeterminate D. 
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The turbo-encoded output sequence is given by three blocks. The first block ("sys- 
tematic"), uRq of length Uq corresponds to a punctured version of the encoder input. 
The second and the third blocks ("parity") uIIiA^^BRi and un2A^^BR2 of length rii 
and 71.2, respectively, are obtained by permuting the input sequence, passing it through 
the (tail-biting) RCC encoder, and puncturing its output. The coding rate is given by 
K/{no + ni+n2)% 

Next, we discuss the optimization of the TC encoder parameters, namely, the RCC 
generator polynomials, interleavers and puncturing matrices. Intuitively, G in (fT8|) should 
mimic as closely as possible the generator matrix of a random linear code. In particular, a 
necessary (non- sufficient) condition for approaching the Shannon limit is that the encoder 
output is marg'ma//?/ uniformly distributed [771 [35]. In fact, G should map the statistically 
dependent and maginally non-uniform binary symbols of the input u into channel symbols 
X with the required capacity-achieving uniform distribution. This problem is discussed 
in [35] for the case of non tail-biting TCs and in the limit of infinite block length (a 
"convolutional coding" framework). In particular, it is shown that the encoder output 
Xk marginally uniform when k is large, for a non-uniform i.i.d. encoder input u{D) if 
and only if b{D) is not a multiple of a{D). Also, it is shown that under mild conditions 
on the generator polynomial a{D) the state at time k of the RCC encoders is uniformly 
distributed over the encoder state space irrespectively of the input probability, for an i.i.d. 
input sequence and large k. 

Our problem differs from [35] in the fact that we consider a block coding framework 
and tail-biting codes. Next, we shall show that by choosing a primitive polynomial a{D), 
in the limit of large block length and large polynomial degree we obtain both a marginally 
uniformly distributed encoder output and an encoder state uniformly disributed over the 
state space for all positions k = 1, . . . ,K in the block. We have: 

Lemma A. The mapping f{D) i-^ F, that maps polynomials f{D) = /o + fiD + 
. . . , fx-iD^"^ in the ring [F2[-D]](i+z)K) of polynomials residues modulo 1 + into the 
K X K circulant matrix whose first row is [/o, /i, . . . , Jk-i] is a ring isomorphism. 

Proof. Clearly, the zero polynomial is mapped into the zero matrix, the polynomial 1 
is mapped into the identity matrix, the mapping is linear and preserves the product, i.e., 
f{D)h{D) modulo 1 + (a cyclic convolution of the polynomial coefficients) is mapped 
into the product FH of the corresponding matrices. □ 

As a consequence, we have: 

Lemma B. A defined above is non-singular if and only if a{D) is not a divisor of zero 

'^This encoder structure cannot implement coding rates smaller than 1/3. If coding rates smaller than 
1/3 are needed, we have to add RCC stages in parallel. However, this was not needed in our simulations. 
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in the ring [F2[-D]](i+£iif). In particular, A is invertible if and only if a{D) and 1 + 
are relatively primes. □ 

Since is a circulant matrix and, for what said before, we wish that its rows look 
as random as possible, we shall choose a{D) to be a primitive polynomial of degree /x. 
The existence of is guaranteed by the following 

Lemma C. For a{D) primitive, A is invertible if and only if 2^^ — 1 does not divide 

K. 

Proof. For Lemma B, we need that a{D) and 1 + are relatively prime. Since 
a{D) is irreducible over F2, it follows that the condition holds if and only if a{D) does not 
divide 1 + . The roots of a{D) are primitive elements of the field F2M, the extension of 
degree over F2. All the non-zero elements of F2M are roots of the polynomial 1 + D'^^"^. 
Hence, a{D) is a factor of 1 + D'^'^~^. Finally, 1 + D'^^~^ does not divide 1 + D^"" if and 
only if 2^* — 1 does not divide K. □ 
The condition that K is not a multiple of 2^ — 1 is easily satisfied in practice, since 
typically K is a. power of two. By choosing a{D) primitive we have that the feedback shift 
register with coefficients given by a{D) in the RCC encoder generates an m-sequence of 
period 2^' — 1, with Hamming weight 2^~^. This has the following nice consequence: 

Lemma D. If 2^ — 1 does not divide the circulant matrix A~^ has first row r 
formed by the concatenation of |_2mItJ Periods of the m-sequence generated by a(D), plus 
K modulo 2^' — 1 extra symbols. In particular, for i^T ^ 2^' — 1 ^ 1 we have that the 
normalized Hamming weight of r is close to 1/2. 

Proof. Since A~^ exists, the first row r is obtained by loading the feedback shif 
register by some non-zero configuration of the memory elements and feeding as input 
the sequence u{D) = 1. This clearly generates |_2'^J Periods of the corresponding m- 
sequence, plus some tail symbols to arrive at the total length K. 
The Hamming weight of r is given by 

K 

2^-1 + A 

.2^ - ij 

where < A < 2^ — 2 is the Hamming weight of the tail symbols. For large /i and 
K ^ 2^ — 1 the above Hamming weight normalized by the block length K is close to 
f^ + A/A^^l/2. □ 
Lemma D implies that we can construct a structured generator matrix G as in ( ITSll 
having a marginal empirical distribution of its entries close to 1/2, i.e., close to that of 
the capacity-achieving linear coding ensemble. For example, consider a{D) = 1 + + 
(or (23)8) and K = 16. The corresponding first row of A~^ is equal to 

r= [1,0,0,0,1,0,0,1,1,0,1,0,1,1,1,1] (19) 
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and has Hamming weight 9, so that 9/16 = 0.5625. If we consider for example length 
K = 64, we would obtain r with Hamming weight 33, so that 33/64 = 0.5156, that is 
already quite close to 1/2. 

By choosing b{D) of degree < fi and a{D) primitive, it follows that a{D) and b{D) 
are relatively prime. From [33] we have that the encoder output Xk is asymptotically 
marginally uniformly distributed. However, since we consider a tail-biting code, a cyclic 
shift of the RCC encoder output corresponds to a cyclic shift of the input and has the 
same marginal distribution. It follows that, for large block length K, the output Xk is 
marginally uniformly distributed for all posititions k = 1, . . . , K. 

Next, we turn our attention to the encoder state. The encoder state space coincides 
with F2, which is an Abelian group (with respect to vector addition). In a tail-biting con- 
volutional code, the circulation state is defined as the initial encoder state Sq such that, 
for a given input u{D), the final state is sk = Sq, i.e., the path in the trellis corresponding 
to the initial state Sq and input u{D) closes onto itself (tail-biting condition). The circu- 
lation state s is a /mear function of the input u{D). In particular, let A' denote the K x fi 
matrix obtained by taking the last fi columns of A~^. Then, it is straightforward to see 
that s = uA'. For example, for the input u{D) = 1 the circulation state corresponding 
to a{D) = 1 + + D'^ and K = 16 is given by s = (1, 1, 1, 1), i.e., the last 4 positions 
of the sequence r in ( |T9l) . Consider an input sequence u{D) with i.i.d. symbols, such 
that P(nfc = 1) = p. As argued before, a good joint source- channel code should have an 
encoder state distribution that is uniform over the state space irrespectively of the input 
bias probability p. This is given by the following: 

Lemma E. For K ^ 2^^ ^ 1 and non-uniform i.i.d. encoder input u{D) with P(ufc = 
1) = p G (0,1)0 the circulation state is almost uniformly distributed over the RCC 
encoder state space. Furthermore, the encoder state at any position in the trellis is also 
almost uniformly distributed. 

Proof (sketch). First, we prove the following general simple result. 
Fact: Let A denote a finite Abelian group of size q. Consider g = YllLi (^i where the 
elements Oj are independently selected according to the probability that puts mass 1 / (q'— 1) 
on all non-zero elements of A and probability on the zero element (additive indentity 
of the group). Then, the distribution of g converges to the uniform distribution on A for 
large m. 

Proof of the Fact: Consider m = 2, and without loss of generality denote the elements 
of A by the integers 0, 1, . . . , g — 1, where denotes the additive identity and the -|- rule 

^°If p = or /9 = 1 the entropy of u{D) is equal to zero, and we shall not encode constant all-zero or 
all-ones inputs. Hence, this restriction does not involve any loss of generality in our context. 
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is given by the addition rule of the group. Then, 



9-1 

IP(^? = = I]p(^-jWj), ^ = 0,...,q-l (20) 



By defining the matrix 



'7 [11" -I] 



(where 1 denotes the all-ones vector of length q) and the probability vectors po = 
(1,0,..., 0)^, pi = (0, l/{q - 1), ... , l/(g - 1))^ and ps = {F{g = 0), . . . ,F{g = q - l)f 
we find that pi = Ppo and that fl20l) can be written as 

P2 = Ppi = P^Po 

Extending this to the case of general m, we find that the distribution of g = Yl^i is 
given by p™, = P™Po- We can write 



(g 



1 



i)-i + A \^ f ) (-i)™-Yii^ + (-1)'"- 

qj-'Xi J q 



1 

q q 



^^-^11^ + (-I)'"! - (-ir^ii^ 

q q 



(21) 



From ([211) it is evident that P™ ^11^ for m — i> 00. Then, we conclude that p"* = 
P'^Po — > (l/g)l, as we wanted to show. 

In order to show Lemma E, consider a typical realization of the input sequence u{D). 
This has Hamming weight m ^ pK. Furthermore, these can be located in any position 
of the input sequence with the same probability. The matrix A' defined before contains 
~ K/{2^ — 1) ^ 1 periodic repetitions of the 2^* — 1 non-zero states plus a segment not 
longer than 2^ — 2 states. Each non-zero state appears essentially the same number of 
times in the rows of A'. It follows that a "one" symbol in u{D) uniformly distributed 
over all positions 1, . . . , selects a non-zero state with uniform probability, equal to 
1/(2^^ — 1). A typical input selects m pK non-zero elements of the state space with 
uniform probability. Since the circulation state s = uA' is given by the sum of these m 
random non-zero elements, we apply the above Fact and conclude that, for large K, s is 
uniformly distributed over the whole state space F2 (including the zero state). 

Furthermore, the RCC encoder state when the encoder is driven by an i.i.d. sequence 
u{D) defines a Markov chain. Since a{D) is primitive, it is easy to see that the Markov 
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chain is indecomposable and aperiodic. By construction, the transition matrix of the 
Markov chain corresponding to the RCC state has exactly two non-zero elements in each 
row and column, equal to p and 1 — p. It follows that the (unique) stationary distribution 
is given by the uniform distribution. Since the circulation state is (almost) uniformly dis- 
tributed, any state at any given section of the trellis is also (almost) uniformly distributed, 
and it is exactly uniformly distributed as -fC ^ oo. □ 

Notice that the input sequence to the RCC encoder in the tail-biting turbo encoder 
defined before is an interleaved version of the bit-plane bit sequence. Hence, even if the 
latter is not i.i.d. (e.g., in our case the bit-plane is a non-stationary Markov chain), after 
interleaving it will be close to i.i.d., with a probability p equal to the fraction of "ones" 
in the bit-plane. Notice also that, although we interleave the bit-plane sequence for the 
purpose of turbo encoding and BP decoding, the bit-plane entropy rate is computed 
taking into account the Markov memory structure, as said in Section [31 This has nothing 
to do with the entropy rate of the bit-plane a/ter interleaving, given by H2{f)) where H2{-) 
denotes the binary entropy function. By the concavity of entropy the latter may be much 
larger than the actual bit-plane entropy rate. 

For the choice of b{D) we follow the theory developed in [35]. In particular, letting 
b{D) = 1 yields that the RCC encoder fi-th order empirical distribution, as defined in [77], 
is uniformly distributed over Fg. This follows immediately from the fact that in this case 
the encoder output x{D) coincides with the state bits and for Lemma E we have that the 
empirical distribution of any block {xk-fj., ■ ■ ■ , Xk-i) state bits is uniformly distributed over 
Fg. However, b{D) = 1 does not necessarily yield the best coding performance. Hence, 
we have searched semi-exhaustively for generator polynomials b{D) with bo = b^^ = 1. 

For given RCC generators, the permutations Hi and 112 are chosen at random, by 
trial and error. For K not too small, the effect of optimizing the permutations on the end- 
to-end distortion of the proposed scheme is minimal. In fact, one significant advantage 
of the proposed JSCC scheme is that its performance is not dominated by the "error 
floor" region of the BER : as soon as the BER drops to small values (waterfall region) 
the reconstruction quality of our JSCC scheme rapidly improves and attains satisfactory 
distortion levels even though the residual BER is not as small as it would be required by 
(lossless) data applications. In this respect, the proposed JSCC scheme puts much less 
stress on the code design than a conventional SSCC scheme! 

More care must be deserved to the optimization of the puncturing matrices. When the 
bits in a given bit-plane are not i.i.d. uniformly distributed, sending the systematic part 
over the BIOS channel results in a suboptimal code. Hence, unless H{Up\Up+i, . . . ,Up) 
is very close to 1 bit/symbol, we let no = 0. For Ri and R2, we have constructed (off- 
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line) a library of puncturing matrices by following the incremental redundancy approach 
advocated in [TSl dSl [231 [13] • We stress the fact that the codes are constructed off-line, 
and do not depend on the actual source sequence to be encoded. By using a greedy pseu- 
dorandom progressive growth search (i.e., adding columns to the puncturing matrices), 
we have designed a library of nested puncturing matrices in order to cover all the rates 
C /H — 5, for all quantized values H in pre-determined fine grid in [0, 1]. The library was 
designed assuming a Bernoulli source with entropy H for all the quantized values of H. 
Further discussion on the design of the puncturing matrices is provided in [73] (see also 
[Hi). 

Fig. [9] shows the threshold effect for the case of a binary Bernoulli source of length 
K = 512 X 512, with entropy rate H{U) = 0.5, and a Binary Symmetric Channel (BSC) 
with capacity C = 0.5. Notice that each point of Fig. [9] corresponds to a different 
puncturing matrix, generated by incremental redundancy as explained above. The BER 
is averaged over a random choice of the interleavers. The transition of the family of 
progressively punctured TCs generated in this way is remarkably close to the Shannon 
limit 7] = j^ijj = 1- Again, we stress the fact that our scheme does not need very 
small BER in order to provide good reproduction results. BERs of the order to 10"^ are 
sufficient, as we will see in the examples of Section [5l 




Figure 9: Threshold effect for the case of a Bernoulli source H{U) = 0.5, transmitted over 
a BSC C = 0.5, for a family of progressively punctured TCs. 
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4.2 Belief Propagation decoding 

The decoder is based on the successive decoding structure of Fig. [10], where the bit-planes 
are decoded in sequence, and the decoder at level p makes use of the hard decisions 
made at the upper levels. These hard decisions determine the value of the conditioning 
bits needed for the conditional Markov model at each level. An iterative version of the 
multistage decoder where soft information is exchanged between the bit-plane decoders 
was also considered and simulated, but we found that it does not provide any substantial 
improvements and therefore the additional (significant) complexity is not justified. 
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Figure 10: Structure of the successive bit-plane joint source-channel decoder. 

It is assumed that the decoder knows the probability model Pg^^'' (u) defined in Section 
[3] with the LLRs given in f|T5|) as model parameters. As already said, the model parameters 
have to be sent separately, and need to be decoded with very high reliability. Fortunately, 
the model redundancy is very small in comparison with the data length and therefore we 
can afford to use a low-rate code to protect the model parameters against channel errors. 
This is just a standard channel coding problem that we do not address here. However, 
we would like to stress that the simulation results and the comparison with conventional 
SSCC is made by including the transmission of the model parameters, both in terms of 
redundancy and in terms of errors, i.e., the comparison is fair and was made without any 
optimistic assumption on perfect reception of the side information 9. 
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Let's focus on the p-th component decoder and let y denote the BIOS channel output 
corresponding to the transmission of the coded p-th bit-plane x*^^-* = u'^^-'Gp. The goal of 
the p-th decoder is to compute the symbol-by-symbol posterior marginal probability 

P^(n,,,|y(^),G(^+^),...,G(^)) (22) 

where u'^p+^\ . . . ,u'^'^) denote the hard decisions on the upper bit-planes made at previ- 
ous decoding stages. This can be accomplished (approximately) by the BP algorithm 
applied to the FG of the underlying joint probability distribution. In the case of Marko- 
vian sources and TCs, the BP takes on the particularly appealing form of three BCJR 
forward-backward algorithms [78] exchanging soft information in the form of "extrinsic" 
log- likelihood ratios (LLRs) [10] . This follows by standard application of the Sum-Product 
computation rules [TU] to the resulting FG. Hence, this section is devoted to illuminating 
the FG structure. 

Recall the "marginalization" notation of [TO] for which denotes the sum over all 
variables in the summation argument while keeping v fixed. Furthermore, a probability 
distribution P(f ) needs to be determined only up to a proportionality constant (denoted 
by oc), that can be obtained by imposing the normalization "^^^{v) = 1. Using Bayes 
rule and neglecting irrelevant proportionality terms, we obtained the factorization 

F^(«,,|y(^),G(^+^), . . . ,Q(^)) = Yl Ff )(u(-)|y(-),G(^+^), . . . ,Q(^)) 

oc 5^P(y(P\u(*')|G(?'+i),...,Q(^),^) 



^«p,fe x(p) 



EE 

~«p,fe x(p) b'=i 
.pW(u{p)|{i(P+i),...,Q{^)) 



l{x(P) = u^P^Gp} ■ 



(23) 



where we have used the fact that the BIOS channel is memoryless and where P(x''P) lu^^-*) = 
l{x*^P) = u'^^-'Gp} (an indicator function) since x^^-* is a deterministic function of u*^^-*. 

Eventually, the desired posterior symbol-by- symbol probabilities are obtained by marginal- 
izing the joint distribution in the last line of (!23|) . From (fT3|) and (fT4|) . the source proba- 
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K 



1 7rp,i = 
otherwise 



bihty term factors as 

u(^)) = P(7rp,i) W P^(7rp,fc+i|7rp,fc, {up.^y ■ {p', k') E Sp^k}) (24) 
fe=i 

where the a priori probabihty of vTp i is given by 

IP(7rp,i) 

I u oine] 

and where the state transition probabihty is given b\l"] 

I F^{upk = u\K) if^p.fc+i = K^M) and 

[ otherwise 

The corresponding FG takes on the form of a trelhs [10], reflecting the state transition 
diagram of the Markov probabihty model. 

The other non-trivial component of the overall FG is given by the factorization of the 
code indicator function 1{x*^p) = u^^^Gp}. It is well-known that for TCs this takes the 
form of two trellises, one for each RCC component, interconnected by interleavers. 

The factor graph (FG) corresponding to the factorization in the last line of fl23|) is 
given in Fig. [11] We use Wiberg's notation (see [ID]), for which the FG is a bipartite 
graph with variable nodes (circles) and function nodes (boxes). State nodes are denoted 
by filled circles. A variable node is connected to a function node if the corresponding 
variable is an argument of the corresponding factor [10]. In our case, the variable nodes 
correspond to the bit-plane bits Up^k, to the Markov source states iip^k and to the RCC 
encoder states. The function nodes correspond to the state transition probabilities and 
to the BIOS channel transition probabilities. The channel output symbols i/pj and the 
conditioning bits from the uper bit-planes appears as "dongles". The channel outputs 
corresponding to punctured coded symbols are represented as crossed dongles. These 
symbols are treated as erasures by the decoder. 

Standard application of Sum-Product computation rules to this FG yields the iterative 
BP decoder, where one BCJR algorithm for each trellis subgraph in the FG is used. 
The three BCJR component decoders exchange messages in the form of LLRs [10] via 



^^The update of the shift register in the block diagram of Fig. [5] is given by 
where Ir'p.fc denotes a right shift by one position of the state register. 
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Figure 11: Factor graph underlying the p-th component BP decoder. 
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the degree-3 variable nodes Wpi, . . . ,Up^K- In our simulations, we used a BP scheduling 
where the BCJRs are activated in round-robin (order is basically irrelevant). It should 
be noticed that symbol-by-symbol hard decisions are made for each stage p and are used 
in subsequent stages as a priori information. However, in the soft-reconstruction scheme, 
the symbol-by-symbol a posteriori probabilities for each bit-plane are used in order to 
compute the reconstructed source via fllip . 

5 Results 



In this section we present some results in terms of Peak Signal to Noise Ratio 




versus efficiency 77, measured as the number of source samples (pixels) per channel use. 
We considered two 8-bit gray scale test images of size 512 x 512 pixels, referred to as 
"Goldhill" and "Lena". We run simulations over a Binary Symmetric Channel (BSC) 
and we designed our schemes for a BSC cross-over probability p = 0.05 (corresponding to 
C = 1 — H2{p) = 0.7136 bits per channel use). 

For comparison, we have considered a SSCC scheme obtained by concatenating the 
output of the JPEG2000 encoder with a standard punctured TC with rate optimized in 
order to work as close as possible to the channel capacity. The simulation were performed 
by using the OpenJPEG library, an open-source JPEG2000 codec written in C language 
|79j . We consider the case in which the error resilience tools provided by JPEG2000 
standard (i.e SOP marker, SEGMARK marker and ERTEM strategy [9]) are enabled. 
In order to compute the performances of the JPEG2000 compressed images, we transmit 
and protect separately the header and the markers: they are received error free by the 
decoder. 

As for the SSCC, several generator polynomials have been tested. Here we report 
only the case of (37, 21)3 [Hj, that yields the best results. Since the output of JPEG2000 
can be considered as i.i.d. uniform bits, the TC are conventional and we used standard 
regular puncturing patterns available in the literature. 

We would like to stress the fact that these results are presented here for the purpose 
of a proof of concept: no particular effort has been made to optimize the codes beyond 
the simple incremental redundancy greedy selection of the puncturing patters described 
before. We hasten to say that the result might be probably improved by better code 

^^PSNR = lOlogj^oi^E ^ measure of quality of image reconstruction where MSE= ■^IE[|s — s^P] is 
the Mean Squared distortion of the reconstructed source and A is the peak-to-peak signal amplitude. In 
our case, A — 255, since we are working with 8-bit images. We fix the source realization to be one of the 
test images considered and we take expectation ]£[•] with respect to the channel noise. 
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design. Furthermore, as mentioned in the hterature review of Section [T], a whole range of 
schemes between the basic SSCC scheme considered here and a fully JSCC are available, 
based on tuning the channel coding redundancy for each JPEG2000 encoded block [HI 
HSl Hni H?! HH, HHl [50] . We did not consider comparisons with these schemes since they 
are considerably more complicated than the proposed JSCC, and they are actually not 
yet used in today's applications. Given the number and the variety of schemes proposed 
in the literature, a full comparison is well beyond the scope of this paper. 

Tables [2] and [3] summarize the main characteristics of the JSCC schemes for the two 
test images considered, respectively. The number of encoded bits in each bit-plane is 
indicated by Kp. This may be less than 512^ = 262144 since some segments, especially 
for the top bit-planes, might be identically zero. The third column shows the empirical 
entropy rate Hp = H{Up\Up+i, . . . , Up) as defined in Qj. The fourth column shows the 
bound on coding efficiency for each bit-plane given by f]p = C/Hp. We used generators 
(23, 35)8 for all bit-planes. 



Table 2: Goldhill 



Bit-plane p 


length Kp 


empirical entropy Hp 


efficiency rjp 


8 


16384 


0.2966 


2.4063 


7 


29184 


0.3158 


2.2595 


6 


71680 


0.2468 


2.8906 


5 


142336 


0.2341 


3.0474 


4 


223232 


0.3124 


2.2837 


3 


253440 


0.5431 


1.3138 


2 


262144 


0.8172 


0.8731 


1 


262144 


0.9586 


0.7444 


(Sign) 


262144 


0.6138 


1.1626 



In order to deliver the model parameters to the decoder with high reliability we have 
separately encoded them using a conventional TC (generators (37,21)8, rate 1/3), that 
achieves practically error- free performance for the BSC channel with capacity 0.7136. For 
reconstruction we have used the soft-bit decoding strategy given at the end of Section [231 

Figs. [12] and [13] show the PSNR performance of the proposed JSCC scheme and the 
conventional SSCC. The resulting PSNR is plotted versus the coding efficiency, in order 
to put in evidence the gap of the actual coding scheme with respect to the Shannon 
limit defined as in Section 12. 2[ Notice that the coding efficiency decreases while moving 
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Table 3: Lena 



Bit-plane p 


length Kp 


empirical entropy Hp 


efficiency rjp 


8 


16384 


0.345 


2.0691 


7 


29184 


0.258 


2.7598 


6 


71680 


0.277 


2.5764 


5 


142336 


0.191 


3.7315 


4 


223232 


0.190 


3.7527 


3 


253440 


0.324 


2.1972 


2 


262144 


0.672 


1.0607 


1 


262144 


0.925 


0.7713 


(Sign) 


262144 


0.4977 


1.4337 



towards the right of the horizontal axis. As the coding efficiency decreases, the PSNR 
reaches its maximum value which corresponds to the quantization distortion Dq, equal to 
PSNRcoidhiii = 49.57 and PSNRiena = 49.08 dB for "Goldhill" and "Lena" , respectively. 




H(U) 



I 0.2 f 0.195 f 0.19 I 0.185f 0.18 f 0.175 

a b c J; d e f 

Figure 12: PSNR comparison between SSCC schemes and JSSC schemes for "Goldhill". 

PSNR curves may not tell much about the actual quality of the reconstructed image. 
Hence, we include also snapshots of the reconstructed images for both the JSCC and the 
SSCC schemes corresponding to the efficiencies marked on Figs. [T2]and[T3]as "a,b,c,d,e,f". 
The corresponding values rja, r]h, . . . ,rif have increasing gap from the Shannon limit. 

These snapshots, shown in Figs. [H] and illustrate the main claims of this paper 
and provide experimental evidence about the effectiveness of the proposed JSCC scheme: 
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0.22 



Figure 13: PSNR comparison between SSCC schemes and JSSC schemes for "Lena". 

it is clearly visible that as soon as t] is slightly above the Shannon limit, our JSCC scheme 
achieves an acceptable reconstruction quality. On the contrary, the conventional SSCC 
achieves very poor quality for a much wider range of the gap from Shannon limit, and 
suddenly achieves the quantization distortion when the channel code is able to eliminate 
completely the channel errors (with very high probability). This behavior reflects the 
catastrophicity of the entropy coding stage of the conventional scheme, which is essentially 
eliminated by our approach. Moreover, in the SSCC the reconstructed images clearly 
show the square patterns due to the catastrophic error propagation inside code-blocks, 
because of residual bit errors after channel decoding. Even when the PSNR is not so 
low, these patterns reduce significantly the perceived image quality. On the other hand, 
the reconstructed images generated by the proposed JSCC method present a "salt and 
pepper" noise, that is less annoying for the perceived image quality. This nature of the 
generated noise opens the door to the study of the concatenation of our approach with 
the recently proposed discrete universal denoising algorithm (DUDE) [80], that can take 
advantage of the apparently independent memoryless residual noise after JSCC decoding 
in order to further improve the image visual quality. This interesting approach is not 
pursued in this paper and it is mentioned in Section [6] among the directions for future 
work. 

6 Conclusion 

We presented a new scheme for pragmatic Joint Source- Channel Coding that builds over 
two information theoretic key ideas: a) the optimal rate-distortion performance can be 
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Figure 14: Visual comparison for JSSC (first set) and the SSCC (second set) and the 
Goldhill image, for decreasing values of rja, ... , rjf indicated in Fig. [12] (from left to right). 
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Figure 15: Visual comparison for JSSC (first set) and the SSCC (second set) and the 
Lena image, for decreasing values of r/a, ••• , r// indicated in Fig. [13] (from left to right). 
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univerally approached by low- dimensional quantization followed by entropy coding; b) 
the optimal performance of (almost) lossless joint-source channel coding for a general 
discrete source and a wide class of channels (BIOS channels in our case) can be attained 
by linear coding. Our method is general in the sense that allows to take advantage of the 
vast experience developed in lossy compression of practically relevant sources. In fact, the 
scheme keeps the front-end of state-of-the art source transform coders, comprising a linear 
projection on a convenient orthonormal basis (such as DWT, short-time DFT, DOT) and 
scalar quantization, unchanged. Also, the probability models developed in state-of-the art 
source coders can be reused, provided that these models can be represented in some easily 
factored form, for the purpose of the joint source-channel Belief Propagation decoder. We 
developed an example based on JPEG2000 and Turbo Codes. In our case, the probability 
model reduces to a binary Markov chain for each bit-plane, conditionally on the bits 
from upper bit-planes. The Markov chains are generally non- stationary. Nevertheless, we 
showed that there exist binary linear codes that operate arbitrarily close to the Shannon 
limit for any arbistrary source: a single codebook can handle all sources with given sup- 
entropy. This prompts for a scheme that selects the coding rate for each bit-plane by 
measuring the empirical entropy of the bit-plane itself, based on the probability model 
used by the decoder, and transmitted to the decoder separately. 

The proposed scheme was simulated for some classical test images over a BSC, and 
compared with a standard separated approach based on JPEG2000 and state-of-the art 
Turbo Codes. We achieve much better performance in terms of PSNR and, most impor- 
tantly, in terms of subjective image reconstruction quality. Remarkably, our scheme puts 
very little stress on the channel coding scheme, and does not require to attain very low 
BER in order to achieve good end-to-end distortion. We interpret this fact by noticing 
that using linear codes we eliminate the catastrophic behavior of traditional entropy cod- 
ing based on Huffman or arithmetic coding, by producing a mapping of the redundant 
quantization bits over the channel symbols that has a well-conditioned "soft-inverse". 
Furthermore, the use of BP decoding allows to exploit the decoder soft-output in order 
to produce (approximated) MMSE estimates of the transform coefficients using soft-bits. 

We would like to conclude by enumerating some topics for further research: 

• Several families of linear codes can be used instead of punctured TCs. In particular, 
instead of using a library of codes to cover a quantized range of rates, we can use 
fountain codes [HIl EO] in order to produce the required amount of redundancy "on 
the fly", depending on the empirical entropy of the source and, in the case of a 
compound of BIOS channels whose capacity is known to the transmitter, depending 
on the capacity of the channel. The use of fountain codes for universal lossless 
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compression was investigated in [23], and a similar encoding scheme can be used in 
this context. Furthermore, the same approach can be used in multicast applications, 
where the same source must be sent to several users and each user may have a 
different channel capacity, as in the classical "fountain coding" setting. 

• The representation of the probability models implicitly assumed in various state- 
of-the art source coders in terms of (conditional) Markov chains is a problem of 
independent interest. We did not investigate in the details other image coders 
or coders for different kind of sources, but we believe that an approach similar 
to that taken here can be applied to a variety of source coders. In particular, 
Markovian models for state-of-the art speech/audio and video coders may lead to 
very interesting practical applications. 

• The restriction to BIOS channels is made here essentially for simplicity. It is clear 
that the output of the binary joint-source channel encoder can be concatenated with 
any suitable modulator in order to drive a non-binary input channel, as commonly 
done in a "coded modulation" approach. In particular, the binary multistage encod- 
ing advocated here to encode the bit-planes can be married to superposition coding: 
instead of sending the encoded bit-planes in sequence, these can be modulated and 
superimposed. This approach may lead to efficient schemes for multicasting a com- 
mon source to several users over Gaussian channels in different SNR (or fading) 
conditions [EH [821 [831 [H [85] . 

• From our simulations it appears that the type of reconstruction "noise" achieved by 
the proposed scheme is very different from the typical effect of non-perfect recon- 
struction at the output of standard source coders. In fact, in our case it appears to 
be much closer to some additive independent noise. Hence, our scheme appears to 
be ideally suited to be concatenated with the DUDE algorithm for denoising [80] . 
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APPENDIX 



A Proof of Theorem 1 

Consider a memoryless stationary BIOS channel with input alphabet F2, output alphabet 
3^, transition probability PY\x{y\x), and capacity C. Consider also a binary arbitrary 
source V, defined by the sequence of K-th order joint probability distributions pI^\'v) 
over Ff , for K = 1,2, . . .. Notice that these probabilities need not have any structure, 
and the source need not be stationary or ergodic. The sup-entropy rate H(y) is defined 
in [SB] as the limsup in probability of the sequence of random variables — log2 P\J^\\)^ 
for K 00. Theorem 1 is a simple corollary of the following results. 

Let b G Ff be a binary vector and Gi G F^^^ a binary matrix. Consider the linear 
encoding rule c = bGi and some decoding rule ipi : F^. We define the conditional 

decoding error probability as 

P(ei|b)=P(^i(y)7^b|x = bGi) (25) 

Then, we have Problem 11, p. 114] 

Lemma 1. For every e, 5 > and sufficiently large there exist Gi and ipi such that 
P(ei|b) < e,Vb G Ff, with f > □ 

We stated Lemma 1 in terms of the conditional error probability that, due to uniform 
error property of linear codes, is independent of the information message and therefore 
coincides with the average and maximal error probabilites. In particular, we let ipi be the 
Maximum Likelihood decoding rule, defined by 

V;i(y) = argmax^^,^^B P(y|b'Gi) (26) 

Notice that this rule ignores the a priori probability of the information messages. Further- 
more, the resulting average error probability does not depend on this a priori probability. 

Let V ~ -Py^^ be a vector of length K generated by the arbitrary binary source V , and 
let G2 G F^^^ be a binary matrix. Consider the linear encoding rule b = VG2 and some 
suitable decoding rule ?/'2 : Ff ^ Ff^. We define the average decoding error probability as 

P(e2) = ^(^2(b) ^ v|b = vG2)P^''^(v) (27) 

The following result, which is the key for the proof of Theorem 1, is due to Verdii and 
Shamai [13] and generalizes a well-known result for memoryless sources [T2| Problem 7, 
p. 24] to sources with an arbitrary statistics: 
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Lemma 2. For any e,S > and sufficiently large K there exist G2 and 1^2 such that 
P(e2) < e with f < H{V) + 6. □ 

Putting together Lemma 1 and 2, consider the following linear coding scheme. Fix 
ei,e2,(5i,52 > and choose K,B and such that there exist Gi satisfying Lemma 1 
with €1,61 and G2 satisfying Lemma 2 with €2,62- We map linearly the source sequence 
V of length K into the channel codeword x = VG2G1 of length A^. At the receiver, we 
use a concatenated decoder based on ipi and ip2, that is, we let v = 'ip2{i^iiy)) ■ Notice 
that ipi^-) is applied to y by ignoring the prior probability on VG2 induced by the source 
distribution, and 'ip2{-) is applied on the inner decoding output ipiiy) by ignoring the 
actual channel output y. This corresponds to separated decoding and it is, evidently, 
suboptimal. The error event e of the concatenated decoder is contained in ei U 62, where 
Ci and 62 are the error events of the inner and outer decoders, respectively. By the union 
bound we have 

F(e) <P(ei)+P(e2) < 61 + 62 = e (28) 
The resulting efficiency is given by 

V = ^>W^>^-S (29) 
' AT- ^ - HiV) 

for some S > 0. Clearly, both e and 6 vanish as 61, 62,^1 and S2 vanish. This proves 
Theorem 1. 

Finally, it is clear that the suboptimal concatenated linear scheme with separated two- 
stage decoding given above cannot perform better than an optimal linear scheme where 
we design directly an encoding matrix G G F^^^ (not necessarily as given by the product 
of a "compression" matrix G2 and a "channel coding" matrix Gi), and where we use the 
optimal Maximum a Posteriori (MAP) decoder instead of the concatenation of tpi and 
'02 • For the sake of completeness, the MAP decoder is given by 



argmax^gjpA' r(u|y 

^ugF 



argmax^^^K P(y|uG)Pf V) (30) 
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